PCAP: a whole-genome assembly program.

نویسندگان

  • Xiaoqiu Huang
  • Jianmin Wang
  • Srinivas Aluru
  • Shiaw-Pyng Yang
  • LaDeana Hillier
چکیده

We describe a whole-genome assembly program named PCAP for processing tens of millions of reads. The PCAP program has several features to address efficiency and accuracy issues in assembly. Multiple processors are used to perform most time-consuming computations in assembly. A more sensitive method is used to avoid missing overlaps caused by sequencing errors. Repetitive regions of reads are detected on the basis of many overlaps with other reads, instead of many shorter word matches with other reads. Contaminated end regions of reads are identified and removed. Generation of a consensus sequence for a contig is based on an alignment of reads in the contig, in which both base quality values and coverage information are used to determine every consensus base. The PCAP program was tested on a mouse whole-genome data set of 30 million reads and a human Chromosome 20 data set of 1.7 million reads. The program is freely available for academic use.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generating a genome assembly with PCAP.

This unit describes how to use the Parallel Contig Assembly Program (PCAP) to assemble the data produced by a whole-genome shotgun sequencing project. We present a basic protocol for using PCAP on a multiprocessor computer in a 300-Mb genome assembly project. A support protocol to prepare input files for PCAP is also described. Another basic protocol for using PCAP on a distributed cluster of c...

متن کامل

Application of a superword array in genome assembly

We introduce a data structure called a superword array for finding quickly matches between DNA sequences. The superword array possesses some desirable features of the lookup table and suffix array. We describe simple algorithms for constructing and using a superword array to find pairs of sequences that share a unique superword. The algorithms are implemented in a genome assembly program called...

متن کامل

Fosmid-based physical mapping of the Histoplasma capsulatum genome.

A fosmid library representing 10-fold coverage of the Histoplasma capsulatum G217B genome was used to construct a restriction-based physical map. The data obtained from three restriction endonuclease fingerprints, generated from each clone using BamHI, HindIII, and PstI endonucleases, were combined and used in FPC for automatic and manual contig assembly builds. Concomitantly, a whole-genome sh...

متن کامل

Physical map-assisted whole-genome shotgun sequence assemblies.

We describe a targeted approach to improve the contiguity of whole-genome shotgun sequence (WGS) assemblies at run-time, using information from Bacterial Artificial Chromosome (BAC)-based physical maps. Clone sizes and overlaps derived from clone fingerprints are used for the calculation of length constraints between any two BAC neighbors sharing 40% of their size. These constraints are used to...

متن کامل

The Genome Sequence of the Fungal Pathogen Fusarium virguliforme That Causes Sudden Death Syndrome in Soybean

UNLABELLED Fusarium virguliforme causes sudden death syndrome (SDS) of soybean, a disease of serious concern throughout most of the soybean producing regions of the world. Despite the global importance, little is known about the pathogenesis mechanisms of F. virguliforme. Thus, we applied Next-Generation DNA Sequencing to reveal the draft F. virguliforme genome sequence and identified putative ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Genome research

دوره 13 9  شماره 

صفحات  -

تاریخ انتشار 2003